Skip to content

Conversation

@dungba88
Copy link
Contributor

@dungba88 dungba88 commented Nov 22, 2024

Description

fixes #13564

Added a new Query which wraps around KnnFloatVectorQuery and does re-ranking for quantized index using full precision vectors. The idea is to first run KnnFloatVectorQuery with over-sampled k (x1.5, x2, x5, etc) and then re-rank the docs using full-precision (original, non-quantized) vector, and finally take top-k.

Questions:

  • Should we expose the target inside KnnFloatVectorQuery so that users don't need to pass the target twice? Currently it only exposes the getTargetCopy() which requires array copy so it's inefficient, but I assume the intention is to encapsulate the array so that it won't be modified from outside?
  • Maybe out of scope for this PR, but I'm curious how people think about using mlock for preventing the quantized vectors from being swapped out, as loading fp vectors (although only a small set per query) means there will more pressure on RAM.

Usage:

KnnFloatVectorQuery knnQuery = ...; // create the KnnFloatVectorQuery with some over-sampled k
RerankKnnFloatVectorQuery query = new RerankKnnFloatVectorQuery(knnQuery, targetVector, k);
TopDocs topDocs = searcher.search(query, k);

@dungba88
Copy link
Contributor Author

dungba88 commented Nov 22, 2024

The build fails with The import org.apache.lucene.codecs.lucene100 cannot be resolved, I thought this is already in mainline. Will check.

Edit: It has been moved to backward codecs. Will use something more stable.

@dungba88
Copy link
Contributor Author

I have a preliminary benchmark here (top-k=100, fanout=0) using Cohere 768 dataset.

image

Anyhow I can see these 2 things that should be addressed:

  • If we access the full-sized vectors, it will swap the memory that is allocated (either through preloading, or through mmap) for quantized vectors (main search phase) if there's not enough memory. Eventually, some % part of the quantized index will be swapped out which will slower the search. If we have to load all full-precision vectors to memory, then that kinda defeats the purpose of quantization. I'm wondering if there could be a way we can access full-precision vectors without interfering with the space of quantized vectors.
  • The latency could be better. With oversample=1.5 (second dot) for 4_bit, we have around the same latency and recall as baseline. Although one can argue that we can save memory compared to baseline, with new access pattern of two-phase search that saving might be diminished. Otherwise it seems to have little benefit over just using plain HNSW.

}
Weight weight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1.0f);
HitQueue queue = new HitQueue(k, false);
for (var leaf : reader.leaves()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be switched to parallel execution similar to AbstractKnnVectorQuery?

Copy link
Contributor Author

@dungba88 dungba88 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I was using single-thread as a simple version and try to benchmark the latency first, since multi-thread could add some overhead as well. This class only does vector loading and similarity computation for a small set of vectors (k * oversample) so it's not as CPU-intensive as the AbstractKnnVectorQuery

I'll also try multi-thread and run the benchmark again. From the below benchmark, the re-ranking phase only adds a trivial amount of latency it might not help much. Also the benchmark code seems to force merge so there's only a single partition, we need to change so that there are multiple partitions.

@dungba88
Copy link
Contributor Author

dungba88 commented Nov 27, 2024

Edit: My previous benchmark was wrong because the vectors are corrupted

First benchmark show the recall improvement for each oversample with reranking. It now aligns with what was produced in #13651.

Screenshot 2024-11-30 at 7 50 50

Second benchmark compare the latency across all algorithms. We are still adding only a small latency for the reranking phase.

Screenshot 2024-11-30 at 7 57 25

Last benchmark, I just ran oversample without reranking, but still cutoff at original K (so they act similar to fanout). This is just to make sure that the reranking phase actually adds value. Expectedly, the recall does not improve much compared to the reranking.

Screenshot 2024-11-30 at 7 52 13

NOTE: The dots in all benchmarks represent the oversample factor with values of 1, 1.5, 2, 3, 4, 5. Oversample of 1 means no over-sampling. See https://github.com/mikemccand/luceneutil/blob/main/src/main/knn/KnnGraphTester.java#L833-L834

@dungba88 dungba88 changed the title Add Query for reranking KnnFloatVectorQuery Add Query for reranking KnnFloatVectorQuery with full-precision vectors Nov 27, 2024
@dungba88
Copy link
Contributor Author

Also this is the luceneutil branch I used for benchmarking: https://github.com/dungba88/luceneutil/tree/dungba88/two-phase-search, which incorporates the test for BQ implementation by @benwtrent and the two-phase search.

Comment on lines 107 to 113
float expectedScore = VECTOR_SIMILARITY_FUNCTION.compare(targetVector, docVector);
Assert.assertEquals(
"Score does not match expected similarity for doc ord: " + scoreDoc.doc + ", id: " + id,
expectedScore,
scoreDoc.score,
1e-5);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test that the results are sorted by exact distance.

Maybe we can also test that the result of the same query with oversample will be "at lease the same or better" than without oversample ? By "better" I mean we should have higher recall. But I'm not sure if it's deterministic

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking again, the docs should be sorted by ord, so my first point should be irrelevant.

Comment on lines 63 to 64
HitQueue queue = new HitQueue(k, false);
for (var leaf : reader.leaves()) {
Copy link
Contributor

@shubhamvishu shubhamvishu Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have the access to IndexSearcher#getTaskExecutor and could use it to parallelize the work across segments(like we did earlier with some other query rewrites). But the HitQueue here isn't thread-safe. I don't know if using concurrency after making insertWithOverflow thread-safe would be really helpful since it looks like the added cost is cheap? or Maybe it will be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. In order to apply parallelism we need to use a per-segment queue, then merge it like in AbstractKnnVectorQuery.mergeLeafResults. I think the added latency is already low, but still want to try if it helps.

@github-actions
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Dec 14, 2024
@mikemccand
Copy link
Member

I think this is a nice overall approach, adding a new RerankKnnFloatVectorQuery that wraps a KNN query that used quantization to get the initial results.

It's reminiscent of Lucene's existing QueryRescorer, to implement multi-phased ranking, except that class doesn't wrap another Query... maybe it should (separately)!

I'm curious about your results here -- why is recall better for 1bit and 4bit than 7bit, when reranking?

@github-actions github-actions bot removed the Stale label May 1, 2025
@dungba88
Copy link
Contributor Author

dungba88 commented May 5, 2025

I'm curious about #14009 (comment) -- why is recall better for 1bit and 4bit than 7bit, when reranking?

The graph is a bit confusing, but the dots are the oversample (from 1 to 5). If we compare the recall with the same oversample, then 7-bit is always better or same. The difference becomes smaller at higher oversample. E.g, at oversample=1 , 7 bit has 20% higher recall than 1-bit but at oversample=5 they are mostly the same.


* GITHUB#13285: Early terminate graph searches of AbstractVectorSimilarityQuery to follow timeout set from
IndexSearcher#setTimeout(QueryTimeout). (Kaival Parikh)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like my IDE automatically removes extra spaces. If there is objection I'll revert that in next rev, along with other feedbacks.

to speed up computing the number of hits when possible. (Lu Xugang, Luca Cavanna, Adrien Grand)

* LUCENE-10422: Monitor Improvements: `Monitor` can use a custom `Directory`
* LUCENE-10422: Monitor Improvements: `Monitor` can use a custom `Directory`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a lot of unrelated changes, probably needs a merge from main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's due to my IDE automatically remove extra white spaces, will revert in next rev.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this problem is fixed for all files in main now, and any new trailing whitespaces will fail the CI build.

import org.apache.lucene.index.IndexReader;

/**
* A Query that re-scores another Query with a DoubleValueSource function and cut-off the results at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds generic to any query, but rewrite always returns a KnnFloatVectorQuery ?

Copy link
Contributor Author

@dungba88 dungba88 Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The createRewrittenQuery method would return a DocAndScoreQuery which is currently internal to KNN, but from the client/API point of view it's the same as any other Query. Moreover it only contains the docId and respective scores.

We can extract createRewrittenQuery into separate class for more reusability if needed. I can put a prerequisite PR if that makes sense.

@vigyasharma
Copy link
Contributor

Thanks for the explanations, @dungba88. I suppose the scenario you're trying to solve for, is when users want to change the matchset of a KnnVectorQuery using full-precision or other reranking. I'm open to this change if it's a valid requirement.

My perception is that it should be solvable by plumbing vector query results into a rescorer, then combining top-K hits with other (lexical) hits. For e.g. is this also a problem for hybrid search in OpenSearch/Elasticsearch ? I suspect they might have independent queries for both with some way to combine results.

@dungba88
Copy link
Contributor Author

is when users want to change the matchset of a KnnVectorQuery using full-precision or other reranking

Yes that's correct, @vigyasharma. We are using a hybrid search where KnnFloatVectorQuery and TermQuery (amongst others) are combined into a single BooleanQuery. Thus it is important to change the matchset of KnnFloatVectorQuery individually.

For hybrid search in OpenSearch/Elastic Search, I'm wondering if @jmazanec15 and @benwtrent have any input. I'm having a feeling that it's quite common to combine lexical + KNN matching into a single BooleanQuery.

@benwtrent
Copy link
Member

kNN queries are completed in the rewrite phase, if any rescoring needs to be done, it should be done during that full phase.

I would expect the experience to be:

RescoreQueryWithVectorQueryThingy(KnnQuery) and the rescore will occur during rewrite (or atleast provide a scorer that iterates the kNN query results calculating the higher fidelity scores).

kNN should be "just another query" and should be combinable with any other query. I realize this is a bit tricky as kNN is unique in that it effectively "collects" its results up-front.

Copy link
Contributor

@vigyasharma vigyasharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for persisting on this @dungba88 , changes look good. I have a few suggestions but this looks almost ready!

scoreDocs[i++] = topDoc;
}
TopDocs topDocs =
new TopDocs(new TotalHits(queue.size(), TotalHits.Relation.EQUAL_TO), scoreDocs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of setting this to configured n, should we retain the total no. of hits and relation from original query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no preference, but I can't find where Scorer or Weight would expose the relation of the original query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the original result count as totalHits, but keep the relation as EQUAL_TO, as it doesn't seem it was exposed anywhere. Usually this would only be visible on search(query, n) with TopDocsCollector.

I also realized that this totalHitsRelation value would not be used anywhere so it would not matter.

@dungba88
Copy link
Contributor Author

Sorry for spamming the replies! I should have gone to the Files changed tab, which allow sending all replies in the same message.

@jmazanec15
Copy link
Contributor

For hybrid search in OpenSearch/Elastic Search, I'm wondering if @jmazanec15 and @benwtrent have any input. I'm having a feeling that it's quite common to combine lexical + KNN matching into a single BooleanQuery.

Not sure Im following the discussion completely. It is common to combine lexical and k-NN in a boolean query, but I think there is a lot of variety in what/how users are implementing hybrid search, so flexibility is great to have. Like @benwtrent mentioned, result computation for k-NN is done upfront, but queries can also be used to re-score after the initial phase via the QueryRescorer (as @mikemccand mentioned awhile ago), so I dont think a separate rescorer is necessary.

I also like the point around lazy iteration: "or atleast provide a scorer that iterates the kNN query results calculating the higher fidelity scores". For expensive re-scoring (I think multi-vector will be), this might be nice to have too for hybrid/boolean queries - I think this approach is take in FloatVectorSimilarityQuery. But, this can probably be taken for future consideration.

@dungba88
Copy link
Contributor Author

result computation for k-NN is done upfront, but queries can also be used to re-score after the initial phase via the QueryRescorer (as @mikemccand mentioned awhile ago), so I dont think a separate rescorer is necessary.

Rescorer can be used, but IIUC Rescorer works only in the collection phase. After the first pass collection we will rescore the final results. This would not work if we combine semantic and lexical matching into a single Query, in that case we can only rescore the combined matches. Like @benwtrent mentioned, rescoring should be done in the rewrite phase. This will work for both the cases where semantic and lexical are combined or where semantic matching alone.

"or atleast provide a scorer that iterates the kNN query results calculating the higher fidelity scores"

This is also handled by this PR, technically not a Scorer but a DoubleValueSources. User can use it to either use full-precision vectors, or even use another field for rescoring (amongst other use cases, as DoubleValueSources is extensible). A potential idea is to have 1-bit vector field for matching and another 4-bit or 7-bit vector field for rescoring. However the quantization cost of query vector even for scalar 7-bit is a bit high. We will tackle it as future optimization to this new Query.

Also thanks @vigyasharma for approving the PR! Can you help to merge it if there is no objection?

@vigyasharma
Copy link
Contributor

Also thanks @vigyasharma for approving the PR! Can you help to merge it if there is no objection?

Yes, I'll merge in these changes tonight. Was waiting a day to allow people to give feedback on the latest revision if they want to.

@github-actions github-actions bot modified the milestones: 11.0.0, 10.3.0 Jun 28, 2025
@vigyasharma vigyasharma merged commit 3404496 into apache:main Jun 28, 2025
8 checks passed
@vigyasharma
Copy link
Contributor

@dungba88 – Can this change go in 10.3 instead of waiting for 11.0? I didn't see anything blocking so I updated the changes entry. But I'm running into some merge issues while backporting, likely because of the DocAndScoreQuery refactor?

If you would like to raise a separate PR for 10.3 backport (against branch_10x), and I can help review it. Or if this can only go in 11.0, then I'll update back the changes entry. Let me know.

@dungba88
Copy link
Contributor Author

It should be possible to backport to 10.3. I'll raise a PR. Thanks @vigyasharma for merging!

@dungba88
Copy link
Contributor Author

I put a backport PR to 10.3 here: #14860

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add refinement of quantized vector scores with fp distance calculations

10 participants